Dataset Lookup

First of all, the objective of this Case Study will be to show the advantages of Bayesian Statistics for small data sets and the ability to estimated better the posterior parameters.. As it is known, Bayesian Statistics allows as to set up prior believes of our predictors with specific probability distributions. This is really useful when we do not have a lot of data and we have some insights on the data. For this reason I have decided to use a data set with 21 variables and I will be reducing the number of observations to simulate what we are trying to show. This data is about the COVID cases in Mexico and the goal is to predict if a patient has COVID or not.

[[https://www.kaggle.com/datasets/meirnizri/covid19-dataset][Dataset]]

rm(list = ls())
data = read.csv("data.csv", header = TRUE)
dim(data)
## [1] 1048575      21
summary(data)
##      USMER        MEDICAL_UNIT         SEX         PATIENT_TYPE  
##  Min.   :1.000   Min.   : 1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.: 4.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :12.000   Median :1.000   Median :1.000  
##  Mean   :1.632   Mean   : 8.981   Mean   :1.499   Mean   :1.191  
##  3rd Qu.:2.000   3rd Qu.:12.000   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :13.000   Max.   :2.000   Max.   :2.000  
##   DATE_DIED            INTUBED        PNEUMONIA           AGE        
##  Length:1048575     Min.   : 1.00   Min.   : 1.000   Min.   :  0.00  
##  Class :character   1st Qu.:97.00   1st Qu.: 2.000   1st Qu.: 30.00  
##  Mode  :character   Median :97.00   Median : 2.000   Median : 40.00  
##                     Mean   :79.52   Mean   : 3.347   Mean   : 41.79  
##                     3rd Qu.:97.00   3rd Qu.: 2.000   3rd Qu.: 53.00  
##                     Max.   :99.00   Max.   :99.000   Max.   :121.00  
##     PREGNANT        DIABETES           COPD            ASTHMA      
##  Min.   : 1.00   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 2.00   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median :97.00   Median : 2.000   Median : 2.000   Median : 2.000  
##  Mean   :49.77   Mean   : 2.186   Mean   : 2.261   Mean   : 2.243  
##  3rd Qu.:97.00   3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 2.000  
##  Max.   :98.00   Max.   :98.000   Max.   :98.000   Max.   :98.000  
##     INMSUPR        HIPERTENSION    OTHER_DISEASE    CARDIOVASCULAR  
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 2.000   Median : 2.000   Median : 2.000   Median : 2.000  
##  Mean   : 2.298   Mean   : 2.129   Mean   : 2.435   Mean   : 2.262  
##  3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 2.000  
##  Max.   :98.000   Max.   :98.000   Max.   :98.000   Max.   :98.000  
##     OBESITY       RENAL_CHRONIC       TOBACCO       CLASIFFICATION_FINAL
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   :1.000       
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:3.000       
##  Median : 2.000   Median : 2.000   Median : 2.000   Median :6.000       
##  Mean   : 2.125   Mean   : 2.257   Mean   : 2.214   Mean   :5.306       
##  3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.:7.000       
##  Max.   :98.000   Max.   :98.000   Max.   :98.000   Max.   :7.000       
##       ICU       
##  Min.   : 1.00  
##  1st Qu.:97.00  
##  Median :97.00  
##  Mean   :79.55  
##  3rd Qu.:97.00  
##  Max.   :99.00

The raw data set consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means “yes” and 2 means “no”. values as 97 and 99 are missing data.

Here we can see a summary of the data, first we have to clean and adapt the data so we can work on it. First of all, I will create the variable that we want to predict that is if a patient has been diagnosed with COVID or not.

data$COVID = ifelse(data$CLASIFFICATION_FINAL <= 3, 1, 2)
data = subset(data, select = -c(CLASIFFICATION_FINAL))
convertToLogic = function(col.name, df) {
  index = which(names(df) == col.name)
  print(index)

  if (length(index) != 0) {
    df[, index] = ifelse(df[, index] == 2, 0, df[, index])
    df[, index] = as.logical(df[, index])
    
  }
  
  return(df)
}

This column will tell us if a patient has been diagnosed with COVID or not. Then, I will factor and format all the other variables to adapt them properly.

data = convertToLogic("COVID", data)
## [1] 21
data$USMER = ifelse(data$USMER == 2, 0, data$USMER)
data$USMER = as.logical(data$USMER)

data$MEDICAL_UNIT = factor(data$MEDICAL_UNIT)

data$SEX = factor(data$SEX, labels = c("female", "male"), levels = c(1, 2))

data$PATIENT_TYPE = factor(data$PATIENT_TYPE, labels = c("returned home", "hospitalized"), levels = c(1, 2))

data$INTUBED = factor(data$INTUBED, labels = c("intubed", "not intubed"), levels = c(1, 2))

data$PNEUMONIA = factor(data$PNEUMONIA, labels = c("pneumonia", "not pneumonia"), levels = c(1, 2))

data$PREGNANT = factor(data$PREGNANT, labels = c("pregnant", "not pregnant"), levels = c(1, 2))

data$DIABETES = factor(data$DIABETES, labels = c("diabetes", "not diabetes"), levels = c(1, 2))

data$COPD = factor(data$COPD, labels = c("copd", "not copd"), levels = c(1, 2))

data$ASTHMA = factor(data$ASTHMA, labels = c("asthma", "not asthma"), levels = c(1, 2))

data$INMSUPR = factor(data$INMSUPR, labels = c("inmsupr", "not inmsupr"), levels = c(1, 2))

data$HIPERTENSION = factor(data$HIPERTENSION, labels = c("hipertension", "not hipertension"), levels = c(1, 2))

data$OTHER_DISEASE = factor(data$OTHER_DISEASE, labels = c("other desease", "not other desease"), levels = c(1, 2))

data$CARDIOVASCULAR = factor(data$CARDIOVASCULAR, labels = c("cardiovascular", "not cardiovascular"), levels = c(1, 2))

data$OBESITY = factor(data$OBESITY, labels = c("obesity", "not obesity"), levels = c(1, 2))

data$RENAL_CHRONIC = factor(data$RENAL_CHRONIC, labels = c("renal chronic", "not renal chronic"), levels = c(1, 2))

data$TOBACCO = factor(data$TOBACCO, labels = c("tobacco", "not tobacco"), levels = c(1, 2))

data$ICU = factor(data$ICU, labels = c("icu", "not icu"), levels = c(1, 2))

data = subset(data, select = -c(DATE_DIED))
summary(data)
##    USMER          MEDICAL_UNIT        SEX                PATIENT_TYPE   
##  Mode :logical   12     :602995   female:525064   returned home:848544  
##  FALSE:662903    4      :314405   male  :523511   hospitalized :200031  
##  TRUE :385672    6      : 40584                                         
##                  9      : 38116                                         
##                  3      : 19175                                         
##                  8      : 10399                                         
##                  (Other): 22901                                         
##         INTUBED               PNEUMONIA           AGE        
##  intubed    : 33656   pneumonia    :140038   Min.   :  0.00  
##  not intubed:159050   not pneumonia:892534   1st Qu.: 30.00  
##  NA's       :855869   NA's         : 16003   Median : 40.00  
##                                              Mean   : 41.79  
##                                              3rd Qu.: 53.00  
##                                              Max.   :121.00  
##                                                              
##          PREGNANT              DIABETES            COPD        
##  pregnant    :  8131   diabetes    :124989   copd    :  15062  
##  not pregnant:513179   not diabetes:920248   not copd:1030510  
##  NA's        :527265   NA's        :  3338   NA's    :   3003  
##                                                                
##                                                                
##                                                                
##                                                                
##         ASTHMA               INMSUPR                  HIPERTENSION   
##  asthma    :  31572   inmsupr    :  14170   hipertension    :162729  
##  not asthma:1014024   not inmsupr:1031001   not hipertension:882742  
##  NA's      :   2979   NA's       :   3404   NA's            :  3104  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##            OTHER_DISEASE                CARDIOVASCULAR           OBESITY      
##  other desease    :  28040   cardiovascular    :  20769   obesity    :159816  
##  not other desease:1015490   not cardiovascular:1024730   not obesity:885727  
##  NA's             :   5045   NA's              :   3076   NA's       :  3032  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##            RENAL_CHRONIC            TOBACCO            ICU        
##  renal chronic    :  18904   tobacco    : 84376   icu    : 16858  
##  not renal chronic:1026665   not tobacco:960979   not icu:175685  
##  NA's             :   3006   NA's       :  3220   NA's   :856032  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##    COVID        
##  Mode :logical  
##  FALSE:656596   
##  TRUE :391979   
##                 
##                 
##                 
## 

Here we see that the data is correctly formated but there are some missing value, so let’s fix that.

Data Cleaning

First we will see how manu missing values there are by rows so we can remove some columns that have a lot of missing values.

print(length(which(is.na(data))))
## [1] 2288376
hist(rowMeans(is.na(data)), xlab = c("Missing values average by rows"), main = c())

Here we see that there are 3 columns with the most missing values.

indexesEmptyCols = which(colMeans(is.na(data)) != 0)

colsWithNA = sort(colMeans(is.na(data[, indexesEmptyCols])), 
                  decreasing = TRUE)

barplot(colsWithNA, las=2)

And the columns that have the most missing values are ICU, INTUBED, and PREGNANT, so let’s remove them.

data = subset(data, select = -c(ICU, INTUBED, PREGNANT))

print(length(which(is.na(data))))
## [1] 49210
data = na.omit(data)

length(unique(which(is.na(data))))
## [1] 0
summary(data)
##    USMER          MEDICAL_UNIT        SEX                PATIENT_TYPE   
##  Mode :logical   12     :591811   female:513216   returned home:833253  
##  FALSE:658255    4      :307177   male  :511936   hospitalized :191899  
##  TRUE :366897    6      : 37868                                         
##                  9      : 37384                                         
##                  3      : 18660                                         
##                  8      : 10097                                         
##                  (Other): 22155                                         
##          PNEUMONIA           AGE                 DIABETES     
##  pneumonia    :137599   Min.   :  0.00   diabetes    :122415  
##  not pneumonia:887553   1st Qu.: 30.00   not diabetes:902737  
##                         Median : 40.00                        
##                         Mean   : 41.89                        
##                         3rd Qu.: 53.00                        
##                         Max.   :121.00                        
##                                                               
##        COPD                ASTHMA              INMSUPR       
##  copd    :  14376   asthma    : 30497   inmsupr    :  13588  
##  not copd:1010776   not asthma:994655   not inmsupr:1011564  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##            HIPERTENSION              OTHER_DISEASE   
##  hipertension    :159577   other desease    : 27131  
##  not hipertension:865575   not other desease:998021  
##                                                      
##                                                      
##                                                      
##                                                      
##                                                      
##             CARDIOVASCULAR           OBESITY                 RENAL_CHRONIC    
##  cardiovascular    :  20126   obesity    :156961   renal chronic    :  18351  
##  not cardiovascular:1005026   not obesity:868191   not renal chronic:1006801  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##         TOBACCO         COVID        
##  tobacco    : 82675   Mode :logical  
##  not tobacco:942477   FALSE:636274   
##                       TRUE :388878   
##                                      
##                                      
##                                      
## 

Now the data set is clean so let’s start working on it. But first we will shrink it to 1000 observations to work with.

data.small = data[sample(nrow(data), size=1000),]

Bayesian Analysis of the covid variable

First of all, let’s plot a histogram of the COVID variable (the one we want to predict) and see.

rm(list = setdiff(ls(), c("data", "data.small")))

hist(as.numeric(data$COVID))

This is as we expected as we are going to be predicting a binary variable. So let’s use a Bernoulli distribution to explain this data and see how well it fits. First of all lets compute the analytical posterior distribution of the covid variable.

Analytical Study

  1. We assume a Bernoulli distribution for COVID, we will use X to denote that variable.

\[X \ | \ \theta \thicksim Bernoulli(\theta)\] \[f(x \ | \ \theta) = \theta^x \cdot (1 - \theta)^x\]

  1. As we do not have any prior knowledge on the probability of a patient of having covid, we will define the prior distribution as an improper prior. Moreover, we will be using a Beta distribution as in the end we will get a posterior conjugate which will be much easier to work with.

\[\theta \thicksim Beta(0, 0)\] \[f(\theta \ | \ 0, 0) = \frac{\theta^{0 - 1} \cdot (1 - \theta)^{0 - 1}}{B(0, 0)} \] 3. Now we get the likelihood

\[f(data \ | \ \theta) \propto \theta^{k} \cdot (1 - \theta)^{n - k}\]

Being n the total number of observations and k the positive ones.

  1. And finally the posterior distribution

\[f(\theta \ | \ data) = \frac{\theta^{k - 1} \cdot (1 - \theta)^{n - k - 1}}{B(k, n - k)} \]

\[\theta \ | \ data \thicksim Beta(k, n - k)\]

So now that we have the posterior distribution let’s obtain the prediction of the next value called Y given the data

\[Y \ | \ \theta \thicksim Bernuilli(\theta)\] \[P ( Y = 1 | data) = \int_{-\infty}^{\infty}{P(Y=1|\theta) \cdot P(\theta | data) d\theta} \\ = \frac{B(k + 1, n - k)}{B(k, n - k)}\]

n = as.numeric(length(data.small$COVID))
k = as.numeric(length(which(data.small$COVID)))
print(beta(k+1,n-k)/beta(k, n-k))
## [1] 0.38

And here we can see that the probability of a new patient of having covid is 0.362 that is really close to the ML estimator of 0.38

And finally let’s try to obtain the same result numerically

Numerical Study

As we know the distribution of the new observation we will get a random sample and compare.

\[Y \ | \ \theta \thicksim Bernuilli ( Beta (k, n - k))\]

y.sample = rbinom(n, 1, rbeta(1, k, n - k))

mean(y.sample)
## [1] 0.36

Here we see that the estimated probability is almost the same as previously.

covid.prob = rbeta(n, k, n - k)
quantile(covid.prob, probs = c(0.025, 0.975))
##      2.5%     97.5% 
## 0.3504863 0.4123275

And also we see that the confidence interval for the probability of having covid is pretty narrow, so we can be sure that it is correct.

Data Exploration

Now, we will see if the other variables are useful to predict if a patient has covid or not.

rm(list = setdiff(ls(), c("data", "data.small")))

library(ggplot2) # GGally
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Let’s see how the columns distribute with respect to the covid variable.

ggplot(data, aes(x=COVID, y=AGE)) + 
  geom_boxplot()

Here we see that there is a visible difference between the mean of the covid, so this can be a useful variable to use in our model.

plot(table(data$COVID, data$USMER))

plot(table(data$COVID, data$MEDICAL_UNIT))

plot(table(data$COVID, data$SEX))

plot(table(data$COVID, data$PATIENT_TYPE))

plot(table(data$COVID, data$PNEUMONIA))

plot(table(data$COVID, data$DIABETES))

plot(table(data$COVID, data$COPD))

plot(table(data$COVID, data$ASTHMA))

plot(table(data$COVID, data$INMSUPR))

plot(table(data$COVID, data$HIPERTENSION))

plot(table(data$COVID, data$OTHER_DISEASE))

plot(table(data$COVID, data$CARDIOVASCULAR))

plot(table(data$COVID, data$OBESITY))

plot(table(data$COVID, data$RENAL_CHRONIC))

plot(table(data$COVID, data$TOBACCO))

Now, the columns that show off the most are: MEDICAL_UNIT, SEX, PATITENT_TYPE, and PNEUMONIA. This makes sense and we will see after if we are confident that there is a visible difference.

Frequentist LM

Now let’s implement a simple LM model to see how well we can predict a patient to have covid.

rm(list = setdiff(ls(), c("data")))

library(caret)
## Loading required package: lattice
library(lattice)

data.small = data[sample(nrow(data), size=10000),]

index.test = createDataPartition(data.small$COVID, p = 0.5, list = FALSE)

data.test = data.small[index.test,]
data.train = data.small[-index.test,]

rm(index.test)

Now first of all let’s try to use all the variables to try to predict if a patient has covid or not.

fit = train(as.factor(COVID) ~ ., data = data.train, method = "glm", family = "binomial")

summary(fit)
## 
## Call:
## NULL
## 
## Coefficients: (1 not defined because of singularities)
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -3.647697   1.271962  -2.868 0.004134 ** 
## USMERTRUE                            0.094284   0.065983   1.429 0.153031    
## MEDICAL_UNIT2                       -9.650933 196.971035  -0.049 0.960922    
## MEDICAL_UNIT3                        1.024084   1.170356   0.875 0.381563    
## MEDICAL_UNIT4                        1.164500   1.148457   1.014 0.310597    
## MEDICAL_UNIT5                        1.343421   1.195317   1.124 0.261054    
## MEDICAL_UNIT6                        1.016744   1.157167   0.879 0.379591    
## MEDICAL_UNIT7                        1.539165   1.567072   0.982 0.326005    
## MEDICAL_UNIT8                        1.175949   1.188158   0.990 0.322309    
## MEDICAL_UNIT9                        1.017897   1.158113   0.879 0.379441    
## MEDICAL_UNIT10                       1.568705   1.196400   1.311 0.189794    
## MEDICAL_UNIT11                       1.621831   1.205917   1.345 0.178659    
## MEDICAL_UNIT12                       1.018086   1.148563   0.886 0.375402    
## MEDICAL_UNIT13                             NA         NA      NA       NA    
## SEXmale                              0.208295   0.061110   3.409 0.000653 ***
## PATIENT_TYPEhospitalized             0.379112   0.103549   3.661 0.000251 ***
## `PNEUMONIAnot pneumonia`            -0.724558   0.111909  -6.475 9.51e-11 ***
## AGE                                  0.012094   0.002122   5.701 1.19e-08 ***
## `DIABETESnot diabetes`              -0.036152   0.102216  -0.354 0.723578    
## `COPDnot copd`                       0.109181   0.270431   0.404 0.686413    
## `ASTHMAnot asthma`                  -0.043505   0.174163  -0.250 0.802745    
## `INMSUPRnot inmsupr`                 0.333857   0.262762   1.271 0.203883    
## `HIPERTENSIONnot hipertension`      -0.137879   0.095191  -1.448 0.147492    
## `OTHER_DISEASEnot other desease`     0.656741   0.206604   3.179 0.001479 ** 
## `CARDIOVASCULARnot cardiovascular`   0.777438   0.251359   3.093 0.001982 ** 
## `OBESITYnot obesity`                -0.285248   0.083996  -3.396 0.000684 ***
## `RENAL_CHRONICnot renal chronic`     0.326682   0.226655   1.441 0.149495    
## `TOBACCOnot tobacco`                 0.260149   0.114280   2.276 0.022821 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6633.7  on 4999  degrees of freedom
## Residual deviance: 6291.0  on 4973  degrees of freedom
## AIC: 6345
## 
## Number of Fisher Scoring iterations: 10

Here we see that there are a lot of variables that are useless. As the p value of the betas is really high for most of them.

confusionMatrix(as.factor(data.test$COVID), predict(fit, newdata = data.test))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  2803  304
##      TRUE   1411  482
##                                           
##                Accuracy : 0.657           
##                  95% CI : (0.6437, 0.6702)
##     No Information Rate : 0.8428          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.177           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6652          
##             Specificity : 0.6132          
##          Pos Pred Value : 0.9022          
##          Neg Pred Value : 0.2546          
##              Prevalence : 0.8428          
##          Detection Rate : 0.5606          
##    Detection Prevalence : 0.6214          
##       Balanced Accuracy : 0.6392          
##                                           
##        'Positive' Class : FALSE           
## 

Here, we see that we get an accuracy of 0.6578 so it is not that bad, probably it is because we only have a few significant variables as we saw in the correlation graph. So let’s try a simpler model.

fit = train(as.factor(COVID) ~ USMER + PNEUMONIA + MEDICAL_UNIT + DIABETES + HIPERTENSION + AGE + PATIENT_TYPE, data = data.train, method = "glm", family = "binomial")

summary(fit)
## 
## Call:
## NULL
## 
## Coefficients: (1 not defined because of singularities)
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.377997   1.166096  -1.182 0.237317    
## USMERTRUE                        0.082135   0.065546   1.253 0.210174    
## `PNEUMONIAnot pneumonia`        -0.732420   0.110769  -6.612 3.79e-11 ***
## MEDICAL_UNIT2                   -9.775325 196.971007  -0.050 0.960419    
## MEDICAL_UNIT3                    0.965563   1.165979   0.828 0.407606    
## MEDICAL_UNIT4                    1.128609   1.144594   0.986 0.324116    
## MEDICAL_UNIT5                    1.317184   1.190936   1.106 0.268723    
## MEDICAL_UNIT6                    0.994307   1.153313   0.862 0.388616    
## MEDICAL_UNIT7                    1.540980   1.542823   0.999 0.317889    
## MEDICAL_UNIT8                    1.274316   1.183086   1.077 0.281430    
## MEDICAL_UNIT9                    0.993985   1.154172   0.861 0.389122    
## MEDICAL_UNIT10                   1.599234   1.191857   1.342 0.179661    
## MEDICAL_UNIT11                   1.687446   1.201937   1.404 0.160337    
## MEDICAL_UNIT12                   1.003887   1.144657   0.877 0.380476    
## MEDICAL_UNIT13                         NA         NA      NA       NA    
## `DIABETESnot diabetes`          -0.019602   0.100831  -0.194 0.845859    
## `HIPERTENSIONnot hipertension`  -0.132394   0.091982  -1.439 0.150051    
## AGE                              0.011505   0.002089   5.507 3.65e-08 ***
## PATIENT_TYPEhospitalized         0.336352   0.101510   3.313 0.000921 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6633.7  on 4999  degrees of freedom
## Residual deviance: 6344.4  on 4982  degrees of freedom
## AIC: 6380.4
## 
## Number of Fisher Scoring iterations: 10

Now it is better but the medical unit for example, it is only relevant the level 2 and also for other.

confusionMatrix(as.factor(data.test$COVID), predict(fit, newdata = data.test))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  2793  314
##      TRUE   1432  461
##                                          
##                Accuracy : 0.6508         
##                  95% CI : (0.6374, 0.664)
##     No Information Rate : 0.845          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.161          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6611         
##             Specificity : 0.5948         
##          Pos Pred Value : 0.8989         
##          Neg Pred Value : 0.2435         
##              Prevalence : 0.8450         
##          Detection Rate : 0.5586         
##    Detection Prevalence : 0.6214         
##       Balanced Accuracy : 0.6280         
##                                          
##        'Positive' Class : FALSE          
## 

Here we see that the accuracy is almost the same and the kappa so we have not lost a lot of info.

Bayesian LM

The frequentest approach is easier but we if we want to compute confidence intervals for the parameters or predictive intervals we cannot do them. That is why we will be using the Bayesian approach to better study the effects of each variable with covid and get more conclusions. The power of the Bayesian approach is that we obtain the posterior distribution of the parameters so we can study better the relation and the significance. So let’s start.

library(coda)
library(MASS)
library(MCMCpack)
rm(list = setdiff(ls(), c("data", "data.small", "data.test", "data.train")))

fit = MCMClogit(COVID ~ ., data = data.train, burnin=1000, mcmc=210000)
par(mar=c(1, 1, 1, 1))
plot(fit)

summary(fit)
## 
## Iterations = 1001:211000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 210000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                                        Mean        SD  Naive SE Time-series SE
## (Intercept)                      -376.25257 2.097e+02 4.577e-01      12.125949
## USMERTRUE                           0.08932 6.701e-02 1.462e-04       0.003578
## MEDICAL_UNIT3                     373.57969 2.097e+02 4.577e-01      12.128977
## MEDICAL_UNIT4                     373.71923 2.097e+02 4.577e-01      12.129152
## MEDICAL_UNIT5                     373.89959 2.098e+02 4.577e-01      12.129814
## MEDICAL_UNIT6                     373.57257 2.098e+02 4.577e-01      12.129274
## MEDICAL_UNIT7                     374.08585 2.097e+02 4.576e-01      12.123135
## MEDICAL_UNIT8                     373.71724 2.098e+02 4.578e-01      12.132534
## MEDICAL_UNIT9                     373.57129 2.097e+02 4.577e-01      12.128097
## MEDICAL_UNIT10                    374.10892 2.098e+02 4.578e-01      12.130807
## MEDICAL_UNIT11                    374.17674 2.097e+02 4.577e-01      12.127994
## MEDICAL_UNIT12                    373.57465 2.097e+02 4.577e-01      12.129264
## MEDICAL_UNIT13                    372.12884 2.098e+02 4.578e-01      12.135588
## SEXmale                             0.20522 6.008e-02 1.311e-04       0.003188
## PATIENT_TYPEhospitalized            0.39305 1.037e-01 2.262e-04       0.005817
## PNEUMONIAnot pneumonia             -0.72429 1.056e-01 2.304e-04       0.005255
## AGE                                 0.01228 2.068e-03 4.513e-06       0.000108
## DIABETESnot diabetes               -0.03273 1.008e-01 2.199e-04       0.005151
## COPDnot copd                        0.13334 2.682e-01 5.852e-04       0.014348
## ASTHMAnot asthma                   -0.05817 1.739e-01 3.795e-04       0.009339
## INMSUPRnot inmsupr                  0.36506 2.575e-01 5.619e-04       0.012960
## HIPERTENSIONnot hipertension       -0.12582 9.500e-02 2.073e-04       0.005312
## OTHER_DISEASEnot other desease      0.68088 2.039e-01 4.449e-04       0.011037
## CARDIOVASCULARnot cardiovascular    0.78124 2.490e-01 5.433e-04       0.012709
## OBESITYnot obesity                 -0.28849 8.122e-02 1.772e-04       0.004159
## RENAL_CHRONICnot renal chronic      0.30085 2.270e-01 4.954e-04       0.012171
## TOBACCOnot tobacco                  0.24477 1.182e-01 2.579e-04       0.006579
## 
## 2. Quantiles for each variable:
## 
##                                        2.5%        25%        50%        75%
## (Intercept)                      -6.989e+02 -553.95676 -397.00642 -203.67551
## USMERTRUE                        -3.368e-02    0.04431    0.08627    0.13049
## MEDICAL_UNIT3                     1.068e+01  200.71615  394.02809  551.60896
## MEDICAL_UNIT4                     1.082e+01  200.97653  394.32868  551.57777
## MEDICAL_UNIT5                     1.099e+01  201.19860  394.67356  551.62641
## MEDICAL_UNIT6                     1.067e+01  200.82866  394.19995  551.33743
## MEDICAL_UNIT7                     1.119e+01  201.96328  393.72363  551.66027
## MEDICAL_UNIT8                     1.083e+01  200.91157  394.29261  551.70893
## MEDICAL_UNIT9                     1.067e+01  200.91928  394.10903  551.34237
## MEDICAL_UNIT10                    1.122e+01  201.73805  394.22645  551.86116
## MEDICAL_UNIT11                    1.127e+01  201.95237  395.25460  551.53600
## MEDICAL_UNIT12                    1.067e+01  200.80387  394.18120  551.41278
## MEDICAL_UNIT13                    9.651e+00  198.92662  393.91489  551.86470
## SEXmale                           8.722e-02    0.16646    0.20728    0.24744
## PATIENT_TYPEhospitalized          1.831e-01    0.32095    0.39396    0.46260
## PNEUMONIAnot pneumonia           -9.458e-01   -0.78969   -0.73216   -0.65065
## AGE                               8.157e-03    0.01083    0.01223    0.01357
## DIABETESnot diabetes             -2.453e-01   -0.09810   -0.03524    0.02929
## COPDnot copd                     -3.799e-01   -0.03879    0.12502    0.31735
## ASTHMAnot asthma                 -4.142e-01   -0.16743   -0.06236    0.05590
## INMSUPRnot inmsupr               -1.223e-01    0.20092    0.34731    0.54010
## HIPERTENSIONnot hipertension     -3.147e-01   -0.18901   -0.12420   -0.05887
## OTHER_DISEASEnot other desease    2.751e-01    0.54139    0.67796    0.80816
## CARDIOVASCULARnot cardiovascular  3.224e-01    0.59946    0.77744    0.93374
## OBESITYnot obesity               -4.472e-01   -0.33835   -0.28525   -0.23332
## RENAL_CHRONICnot renal chronic   -1.158e-01    0.15806    0.31769    0.42816
## TOBACCOnot tobacco                2.816e-02    0.16188    0.24535    0.32309
##                                      97.5%
## (Intercept)                      -13.29863
## USMERTRUE                          0.22965
## MEDICAL_UNIT3                    695.83160
## MEDICAL_UNIT4                    696.46238
## MEDICAL_UNIT5                    696.62088
## MEDICAL_UNIT6                    696.60999
## MEDICAL_UNIT7                    696.83407
## MEDICAL_UNIT8                    696.28339
## MEDICAL_UNIT9                    696.12496
## MEDICAL_UNIT10                   697.04661
## MEDICAL_UNIT11                   696.39288
## MEDICAL_UNIT12                   696.31882
## MEDICAL_UNIT13                   695.16664
## SEXmale                            0.32226
## PATIENT_TYPEhospitalized           0.58958
## PNEUMONIAnot pneumonia            -0.51199
## AGE                                0.01607
## DIABETESnot diabetes               0.16451
## COPDnot copd                       0.65960
## ASTHMAnot asthma                   0.28862
## INMSUPRnot inmsupr                 0.85074
## HIPERTENSIONnot hipertension       0.05357
## OTHER_DISEASEnot other desease     1.09955
## CARDIOVASCULARnot cardiovascular   1.29783
## OBESITYnot obesity                -0.13603
## RENAL_CHRONICnot renal chronic     0.78549
## TOBACCOnot tobacco                 0.46551

From the Bayesian point of view, we see that the CI for all the parameters does not contain 0 so theoretically all of the predictors are significant with an alpha = 5%.

Lasso

rm(list = setdiff(ls(), c("data", "data.small", "data.test", "data.train")))

library(monomvn)

x = data.frame(lapply(subset(data.train, select = -c(COVID)), function(x) as.numeric((x))))

adaptTo0And1 = function(col.name, df) {
  index = which(names(df) == col.name)

  if (length(index) != 0) {
    df[, index] = ifelse(df[, index] == 2, 0, df[, index])
  }
  
  return(df)
}
 x = adaptTo0And1("SEX", x)
 x = adaptTo0And1("PATIENT_TYPE", x)
 x = adaptTo0And1("PNEUMONIA", x)
 x = adaptTo0And1("DIABETES", x)
 x = adaptTo0And1("COPD", x)
 x = adaptTo0And1("ASTHMA", x)
 x = adaptTo0And1("INMSUPR", x)
 x = adaptTo0And1("HIPERTENSION", x)
 x = adaptTo0And1("OTHER_DISEASE", x)
 x = adaptTo0And1("CARDIOVASCULAR", x)
 x = adaptTo0And1("OBESITY", x)
 x = adaptTo0And1("RENAL_CHRONIC", x)
 x = adaptTo0And1("TOBACCO", x)
 
 summary(x)
##      USMER         MEDICAL_UNIT         SEX          PATIENT_TYPE   
##  Min.   :0.0000   Min.   : 2.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 4.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :12.000   Median :1.0000   Median :1.0000  
##  Mean   :0.3474   Mean   : 8.991   Mean   :0.5066   Mean   :0.8098  
##  3rd Qu.:1.0000   3rd Qu.:12.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :13.000   Max.   :1.0000   Max.   :1.0000  
##    PNEUMONIA           AGE           DIABETES           COPD       
##  Min.   :0.0000   Min.   : 0.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:30.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :41.00   Median :0.0000   Median :0.0000  
##  Mean   :0.1392   Mean   :41.79   Mean   :0.1216   Mean   :0.0134  
##  3rd Qu.:0.0000   3rd Qu.:52.00   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :99.00   Max.   :1.0000   Max.   :1.0000  
##      ASTHMA          INMSUPR        HIPERTENSION    OTHER_DISEASE   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.0314   Mean   :0.0148   Mean   :0.1552   Mean   :0.0258  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  CARDIOVASCULAR      OBESITY       RENAL_CHRONIC       TOBACCO      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.0178   Mean   :0.1578   Mean   :0.0198   Mean   :0.0816  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
y = data.train$COVID
fit = blasso(x, y, mprior = c(0,1))
## t=100, m=14
## t=200, m=9
## t=300, m=12
## t=400, m=10
## t=500, m=12
## t=600, m=13
## t=700, m=12
## t=800, m=10
## t=900, m=9

After training the model lets check for stability

plot(fit, burnin=200, which="m")

acf(fit$m)

Here we see that there is some autocorrelation and it is not stable enough so to make sure let’s add a lot more samples and some thinning.

set.seed(111)
fit = blasso(x, y, mprior = c(0,1), T = 10000, thin = 20)
## t=100, m=9
## t=200, m=13
## t=300, m=11
## t=400, m=11
## t=500, m=12
## t=600, m=11
## t=700, m=9
## t=800, m=11
## t=900, m=13
## t=1000, m=8
## t=1100, m=10
## t=1200, m=11
## t=1300, m=7
## t=1400, m=11
## t=1500, m=10
## t=1600, m=9
## t=1700, m=11
## t=1800, m=10
## t=1900, m=10
## t=2000, m=12
## t=2100, m=13
## t=2200, m=11
## t=2300, m=9
## t=2400, m=11
## t=2500, m=11
## t=2600, m=10
## t=2700, m=10
## t=2800, m=10
## t=2900, m=12
## t=3000, m=11
## t=3100, m=8
## t=3200, m=13
## t=3300, m=11
## t=3400, m=13
## t=3500, m=14
## t=3600, m=12
## t=3700, m=12
## t=3800, m=11
## t=3900, m=12
## t=4000, m=10
## t=4100, m=8
## t=4200, m=11
## t=4300, m=10
## t=4400, m=13
## t=4500, m=9
## t=4600, m=10
## t=4700, m=12
## t=4800, m=12
## t=4900, m=8
## t=5000, m=10
## t=5100, m=9
## t=5200, m=10
## t=5300, m=10
## t=5400, m=13
## t=5500, m=7
## t=5600, m=10
## t=5700, m=11
## t=5800, m=11
## t=5900, m=10
## t=6000, m=10
## t=6100, m=10
## t=6200, m=11
## t=6300, m=9
## t=6400, m=11
## t=6500, m=12
## t=6600, m=12
## t=6700, m=11
## t=6800, m=10
## t=6900, m=9
## t=7000, m=13
## t=7100, m=14
## t=7200, m=13
## t=7300, m=11
## t=7400, m=13
## t=7500, m=10
## t=7600, m=11
## t=7700, m=10
## t=7800, m=14
## t=7900, m=12
## t=8000, m=12
## t=8100, m=10
## t=8200, m=12
## t=8300, m=11
## t=8400, m=7
## t=8500, m=10
## t=8600, m=10
## t=8700, m=10
## t=8800, m=11
## t=8900, m=10
## t=9000, m=10
## t=9100, m=10
## t=9200, m=10
## t=9300, m=10
## t=9400, m=11
## t=9500, m=10
## t=9600, m=11
## t=9700, m=9
## t=9800, m=10
## t=9900, m=14
plot(fit, burnin=1000, which="m")

acf(fit$m)

Now we do not see any periodicity and it is much stable.

plot(fit, burnin=1000, which="s2")

plot(fit, burnin=1000, which="lambda2")

And also we see that it is stable so we can trust that it has converged. So let’s see the most important variables.

plot(fit, burnin=1000)
points(drop(fit$b), col=2, pch=20)
points(drop(fit$b), col=3, pch=18)
legend("topleft", c("blasso-map", "lasso", "lsr"),
       col=c(2,2,3), pch=c(21,20,18))

s <- summary(fit, burnin=1000)
print(s$bn0) # probability that each beta coef != zero
##       b.1       b.2       b.3       b.4       b.5       b.6       b.7       b.8 
## 0.3692222 0.6027778 0.9753333 0.9996667 1.0000000 1.0000000 0.2935556 0.2718889 
##       b.9      b.10      b.11      b.12      b.13      b.14      b.15      b.16 
## 0.2511111 0.4492222 0.4927778 0.9616667 0.9592222 0.9837778 0.4012222 0.7170000
barplot(s$bn0, horiz = TRUE)

Here we see that the most important variables are SEX, PATIENT_TYPE, PNEUMONIA, AGE, CARDIOVASCULAR. All in all we can conclude that the post important variables to predict if a patient has covid or not is:

  • Sex: we will see but a specific gender is prone to have covid.
  • Patient type: this makes sense because if a patient has been hospitalized it is more likely that it will have covid.
  • Age: elder people are more likely to have covid.
  • Pneumonia: as the covid makes similar symptoms as pneumonia, patients with it are much more likely to have it.
  • Cardiovascular: we can see that covid could be more likely to be contracted if you have cardiovascular problems.

Final Model

Finally we will create a model with the variables selected with lasso.

rm(list = setdiff(ls(), c("data", "data.small", "data.test", "data.train")))

library(R2OpenBUGS)

logit.bayes <- function(){
  for( i in 1 : n ) {
    COVID.bin[i] ~ dbern(p[i])
    logit(p[i]) <- b0 + b1 * AGE[i] + b2*SEX.male[i] + b3*PATITENT_TYPE.hospitalized[i] + b4*PNEUMONIA.yes[i] + b5*CARDIOVASCULAR.yes[i]
  }
  
  b0 ~ dnorm(0.0, 1.0E-6)
  b1 ~ dnorm(0.0, 1.0E-6)
  b2 ~ dnorm(0.0, 1.0E-6)
  b3 ~ dnorm(0.0, 1.0E-6)
  b4 ~ dnorm(0.0, 1.0E-6)
  b5 ~ dnorm(0.0, 1.0E-6)
}
COVID.bin=ifelse(data.train$COVID,1,0)
n=length(COVID.bin)
SEX.male = ifelse(data.train$SEX=="male",1,0)
PATITENT_TYPE.hospitalized = ifelse(data.train$PATIENT_TYPE=="hospitalized",1,0)
PNEUMONIA.yes = ifelse(data.train$PNEUMONIA=="pneumonia",1,0)
CARDIOVASCULAR.yes = ifelse(data.train$CARDIOVASCULAR=="cardiovascular",1,0)

data <- list(n=n, COVID.bin=COVID.bin, AGE=data.train$AGE, SEX.male = SEX.male, PATITENT_TYPE.hospitalized = PATITENT_TYPE.hospitalized, PNEUMONIA.yes=PNEUMONIA.yes, CARDIOVASCULAR.yes = CARDIOVASCULAR.yes)

inits <- function(){
  list(b0 = 1, b1 = 0, b2 = 0, b3 = 0, b4 = 0, b5 = 0)
}
output <- bugs(data = data, inits = inits, parameters.to.save = c("b0", "b1", "b2", "b3", "b4", "b5"), model.file = logit.bayes, n.chains = 1, n.burnin=100, n.iter = 1000)
output
## Inference for Bugs model at "/tmp/RtmpgctoDl/model1577061a476ce.txt", 
## Current: 1 chains, each with 1000 iterations (first 100 discarded)
## Cumulative: n.sims = 900 iterations saved
##            mean  sd   2.5%    25%    50%    75%  97.5%
## b0         -1.3 0.1   -1.5   -1.4   -1.3   -1.3   -1.1
## b1          0.0 0.0    0.0    0.0    0.0    0.0    0.0
## b2          0.2 0.1    0.1    0.2    0.2    0.2    0.3
## b3          0.4 0.1    0.2    0.3    0.4    0.5    0.6
## b4          0.7 0.1    0.5    0.7    0.7    0.8    0.9
## b5         -0.8 0.2   -1.3   -1.0   -0.8   -0.6   -0.3
## deviance 6345.8 3.6 6341.0 6343.0 6345.0 6348.0 6354.5
## 
## DIC info (using the rule, pD = Dbar-Dhat)
## pD = 6.1 and DIC = 6352.0
## DIC is an estimate of expected predictive error (lower deviance is better).
b0.post <-output$sims.list$b0
b1.post <-output$sims.list$b1
b2.post <-output$sims.list$b2
b3.post <-output$sims.list$b3
b4.post <-output$sims.list$b4
b5.post <-output$sims.list$b5
ts.plot(b0.post)

acf(b0.post)

ts.plot(b1.post)

acf(b1.post)

ts.plot(b2.post)

acf(b2.post)

ts.plot(b3.post)

acf(b3.post)

ts.plot(b4.post)

acf(b4.post)

ts.plot(b5.post)

acf(b5.post)

We see that the model is stable so let’s check some assumptions. First let’s create the baseline.

linear = b0.post
pred.baseline = exp(linear)/(1+exp(linear))

mean(pred.baseline)
## [1] 0.209602
quantile(pred.baseline,c(0.025,0.975))
##      2.5%     97.5% 
## 0.1801214 0.2427063

Here we see that a baseline person has a 0.22 probability of having covid. So now let’s compare it to other groups.

Age

20 years old

linear = b0.post+b1.post * 20
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.2566922
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.2346928 0.2799761

50 years old

linear = b0.post+b1.post * 50
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.3395505
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.3173965 0.3606602

80 years old

linear = b0.post+b1.post * 80
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.4337181
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.3900243 0.4776586

Here we can clearly see that the higher the age, the more probability people have to have covid.

Sex

Female

linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.1801214 0.2427063

Male

linear = b0.post+b2.post * 1
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.2430516
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.2119389 0.2799126

We see some indication that the male population has higher probability to have covid but it is not significant.

Patient Type

Not hospitalized

linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.1801214 0.2427063

Hospitalized

linear = b0.post+b3.post * 1
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.2867728
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.2333096 0.3457791

Now, this is really significant and it makes sense. If a patient has been hospitazlied, it is really likely that he/she has covid.

Pneumonia

Not pneumonia

linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.1801214 0.2427063

Pneumonia

linear = b0.post+b4.post * 1
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.3547374
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.2958420 0.4179099

Same conclusions for pneumonia

Cardiovascular

Not cardiovascular

linear = b0.post
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.209602
quantile(pred.prob,c(0.025,0.975))
##      2.5%     97.5% 
## 0.1801214 0.2427063

Cardiovascular

linear = b0.post+b5.post * 1
pred.prob = exp(linear)/(1+exp(linear))

mean(pred.prob)
## [1] 0.1104778
quantile(pred.prob,c(0.025,0.975))
##       2.5%      97.5% 
## 0.06538988 0.17185446

But for cardiovascular we are not really sure that it plays a big role, so we will not make assumptions.

Conclusions

We have seen that many variables in this data set are not useful to predict if a patient has covid or not, but with the Bayesian approach and lasso, we have found some that are significant (in this case: age, patient type, and pneumonia) this makes sense. But the power of Bayesian approach is that we obtain the posterior distribution so we can see how significant and if we trust the variable. As for example, the sex and the cardiovascular had really big confidence intervals so we discarded them. This shows the usefulness of the Bayesian approach in comparison with the frequentest one.